Capgemini | Data Engineer Interview Experience | 6 YoE



Round 1: Technical Screening (SQL + Pyspark)

Given below 2 tables:

How many records will come up upon performing - inner join, outer join, left join, right join?

Given an employees file having columns:

employee_id, employee_name, employee_salary, employee_dept, employee_manager

Write a custom schema to read this file, given that separator is '|

First 4 lines in the file are blank, 5th line is header. How would you read it?

Suppose the values for column employee_name is of the format -> "Himani H Sharma". Derive three columns from it - first_name, middle_name, last_name

Find all employees who have salaries greater than their manager's salary.

Find the third highest salary in the whole table, and then per department.

What's the difference between persist() & cache()?

What's the difference between map() & flatMap()?

What kind of AWS services have you used at your job?

Round 2: Client Interview Round (Pyspark + SQL)

What kind of projects have you worked on so far? Describe the architecture along with full/incremental load strategies, and how data validation is performed.

You are working on a data pipeline that processes user activity logs from a mobile app. The logs are stored in JSON format in a data lake (e.g., S3 or ADLS). Each record contains the following fields:

user_id (string)

event_type (string: e.g., "login", "purchase", "logout")

event_timestamp (timestamp)

device_type (string: e.g., "iOS", "Android")

1. Read the JSON file and filter out all the "purchase" event types.

2. For each user, find out the difference between maximum & minimum purchase timestamps.

Below table has users & mobile numbers:

Write Spark Program to generate count of distinct mobile numbers for each user.

Expected output:

Given tables - Customers (id, name) & Orders (orderid, customer id, amount)

Return top 3 customers by total spending along with names and total spent.

Interviewer instruction: Use CTE, join & window function to solve the problem.

Given a table having below cricket teams:

Find out all possible combinations of match fixtures without any repetitions.

For example - Ind vs NZ & NZ vs Ind are considered to be the same match fixture. Only one combination of this should be in the output.

In windowing functions, which clause is mandatory to have? partition by or order by?

How would you determine the number of jobs, tasks & stages given a sequence of transformations like - reading a file, applying union transformation, group by followed by writing the output to a table.

How would 10 GB of data be processed in spark?

Is read() an action or transformation?

Are you aware of SCD? Describe SCD-1, SCD-2 & SCD-3, and whether you've implemented them at your job.

Round 3: HR round

Salary negotiation, job location preference discussion.


Cracked all rounds, got the job offer!